# The Landscape of GPU-Centric Communication

DIDEM UNAT, Koç University, Turkey
ILYAS TURIMBETOV, Koç University, Turkey
MOHAMMED KEFAH TAHA ISSA, Koç University, Turkey
DOĞAN SAĞBILI, Koç University, Turkey
FLAVIO VELLA, University of Trento, Italy
DANIELE DE SENSI, Sapienza University of Rome, Italy
ISMAYIL ISMAYILOV, Koç University, Turkey

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation.

This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.

 $CCS\ Concepts: \bullet\ Networks \rightarrow Programming\ interfaces; \bullet\ Computing\ methodologies \rightarrow Parallel\ programming\ languages; \bullet\ Hardware \rightarrow Communication\ hardware,\ interfaces\ and\ storage; \bullet\ Computer\ systems\ organization \rightarrow Single\ instruction,\ multiple\ data.$ 

Additional Key Words and Phrases: GPUs, communication, MPI, NVSHMEM, NCCL, RCCL, Peer-to-peer Communication, Collective Communication, GPUDirect Technologies.

## **ACM Reference Format:**

Authors' addresses: Didem Unat, Koç University, Istanbul, Turkey, dunat@ku.edu.tr; Ilyas Turimbetov, Koç University, Istanbul, Turkey, iturimbetov18@ku.edu.tr; Mohammed Kefah Taha Issa, Koç University, Istanbul, Turkey, MISSA18@ku.edu.tr; Doğan Sağbili, Koç University, Istanbul, Turkey, dsagbili17@ku.edu.tr; Flavio Vella, University of Trento, Trento, Italy, flavio.vella@unitn.it; Daniele De Sensi, Sapienza University of Rome, Rome, Italy, desensi@di. uniroma1.it; Ismayilov, Koç University, Istanbul, Turkey, iismayilov21@ku.edu.tr.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

©~2024 Association for Computing Machinery.

Manuscript submitted to ACM

#### 1 INTRODUCTION

In recent years, GPUs have become the accelerator of choice for a vast array of applications across both HPC and ML. This quick adoption, driven mainly by the GPU's massive parallelism and fast memory bandwidth, means that most of modern cloud and HPC capability is now concentrated in clusters of GPUs. As of June 2024, 9 of the 10 leading Top500 supercomputers rely on GPU clusters for acceleration and this trend is likely to continue [58]. The only one that does not rely on GPUs relies on highly-vectorized CPU architecture with High-Bandwidth Memory.

While using scores of GPUs has been shown to significantly accelerate computation, communication between the GPUs can quickly become a scalability bottleneck [111]. Traditionally, multi-GPU communication, both within and across nodes, had always been the responsibility of the CPU. From their conception, GPUs were thought of as devices that can supply large amounts of computation but that are also inherently dependent on the CPU for auxiliary tasks like communication. In this CPU-centric model of execution, the routines relaying data for consumption by the GPUs are oblivious to the GPUs' existence.

In the last decade, several advancements, broadly referred to as *GPU-centric* communication, have sought to challenge the CPU's hegemony on multi-GPU execution. At a high level, these advancements reduce the CPU's involvement in the critical path of execution, give the GPU more autonomy in initiating and synchronizing communication and attempt to address the semantic mismatch between multi-GPU communication and computation.

In this paper, we present a comprehensive review of *GPU-centric* communication with respect to vendor mechanism and user-level library supports. Our goal with this survey is to help allay the general state of confusion that arises when a prospective researcher begins wading into the field. We hope to help programmers, engineers, programming model and library designers understand the complexity and diversity of available options because *GPU-centric* communication spans a very broad spectrum of approaches, including hardware innovations like proprietary GPU-to-GPU interconnects and software mechanisms. These mechanisms have distinct benefits and challenges making it unclear when and where they should be preferred. The picture is made even murkier by inconsistent terminology used across literature and vendor differences in offerings.

We organize the paper as follows:

- In Section 2, we define the terminology, provide a definition for *GPU-centric* communication and taxonomize the existing approaches within a node and across nodes to eliminate any confusion.
- In Section 3, we discuss the vendor-provided mechanisms for enabling communication and networking, managing
  memory across devices in a multi-GPU execution. These mechanisms are used as building blocks for higher-level
  GPU-centric software libraries.
- In Section 4, we list the main communication libraries for both *intra-* and *inter-node* setups, discuss their benefits and challenges and rely on existing benchmarking papers to provide insights into their performance.
- In Section 5, we discuss the main research paradigms underlying *GPU-centric* communication, provide an outlook on the field and present open research questions.

We also acknowledge that certain methods and technologies depend on proprietary ecosystems. In this work, we primarily focus on NVIDIA solutions. However, similar solutions are offered by AMD, and we explicitly highlight those that are only offered by one of the two vendors. Intel also offers high-end HPC/ML GPUs. However, at the time of writing, there is not much public information available about the outlook of the product line. Intel Ponte Vecchio is not going to be deployed on new clusters [115], and the next generation Falcon Shores GPUs are not going to be released before late 2025 [105].

| Type              | API    | Data Path | Examples                                           |
|-------------------|--------|-----------|----------------------------------------------------|
| 1 Host native     | Host   | Host      | cuda/hipMemcpy and other host-side comm. (No P2P)  |
| 2 Host-controlled | Host   | Device    | cuda/hipMemcpy and other host-side comm. (P2P)     |
| ③ Device native   | Device | Device    | NVSHMEM/ROCSHMEM GPU-side, direct load/store (P2P) |
| 4 Host fallback   | Device | Host      | Direct load/store (No P2P)                         |

Table 1. Types of intra-node communication methods



Fig. 1. Data paths and API calls of intra-node communication methods

#### 2 TERMINOLOGY AND COMMUNICATION TYPES

We can loosely define *GPU-centric communication* as *mechanisms that* reduce the involvement of the *CPU* in the critical path of multi-GPU execution. This is a very broad definition covering a wide spectrum of solutions, involving both the vendor-level improvements that grant GPUs autonomy in communication and user-level implementations that leverage those improvements. To make this distinction clear, we discuss them in separate sections. In Section 3, we focus on communication mechanisms and primitives provided natively as part of the NVIDIA CUDA and AMD ROCm runtimes. In Section 4, we discuss how those mechanisms give rise to higher-level GPU-centric communication libraries.

We also point out the distinction between *communication* within a node (intra-node) and communication across nodes (inter-node). A single GPU-accelerated node comprises a single shared memory host with multiple GPU cards attached. When communicating within a node, any given GPU can be controlled by a single thread or process, with a shared memory and address space. A multi-node system has multiple such nodes where a different process controls each GPU, and memory is not shared between processes running on different nodes. The communication landscape changes depending on the setup used, since inter-node communication requires handling the GPU-NIC interaction and across-process communication.

Despite classification of communication methods into GPU- and CPU-side is commonly used and can be sufficient for an end-user, it is not always explanatory and accurate. To avoid the vagueness of definitions, we divide the communication methods into several types. These types are based on the executor of each of the operations performed during communication. We define two main operations needed for the communication to take place in the intra-node scenario and four operations for the inter-node one. In the intra-node case, the two components of a communication call are:

- **API**. Defines where the communication API call is made by the programmer or library.
- Data path. Indicates who participates in the data movement and shows the corresponding data path.

The examples showing the classification of intra-node communication mechanism, together with figures depicting the data paths are shown on Table 1 and Figure 1.

| Type                 | API    | Register | Trigger | Data path       | Examples             |
|----------------------|--------|----------|---------|-----------------|----------------------|
| 1 Host native        | Host   | Host     | Host    | Through host (2 | GPU-Aware MPI before |
| 1) Host mative       |        |          |         | copies)         | 2                    |
| (2) GPUDirect 1.0    | Host   | Host     | Host    | Through host (1 | GPU-Aware MPI, NCCL  |
| 2) Gr Obliect 1.0    |        |          |         | copy)           |                      |
| 3 GPUDirect RDMA,    | D/H    | Host     | Host    | Direct          | GPU-Aware MPI, NCCL, |
| ROCnRDMA             | D/11   | 11051    | 11081   | Direct          | NVSHMEM              |
| (4) GPUDirect Async  | D/H    | Host     | Device  | Depends on ③    | GPU-Aware MPI, NVSH- |
| 4) Gr Obliect Asylic |        |          |         |                 | MEM                  |
| (5) Device native    | Device | Device   | Device  | Depends on ③    | NVSHMEM with IBGDA,  |
|                      |        |          |         |                 | GPUrdma              |

Table 2. Types of inter-node communication methods. Cells in bold refer to where a change or optimization has been made. *D/H* means that both device-side and host-side API calls may belong to this type.



Fig. 2. Inter-node communication data and control paths.

The communication method in ①, referred as host native, is made on the host side and does not involve direct P2P (peer-to-peer) access between the devices. These involve all methods that can be launched on the host side with P2P access disabled. Otherwise, as ② (host-controlled) shows, the communication does not involve an extra copy to the host memory, and passes directly through PCIe, NVLink or Infinity Fabric interconnects. NCCL, GPU-aware MPI and \*memcpy operations have host-side API, so they may belong to both ① and ②. With direct access to peer devices memory the device-side API allowed to remove the CPU both from the data path and control path in intra-node communication, as ③, referred as device native, shows on Figure 1. Both NVSHMEM and ROCSHMEM offer host-side API as well, but require P2P access, so its host-side API belongs to ② and device-side API is of type ③. In-kernel P2P direct load and stores offer similar functionality and belong to type ③, but work even with P2P access disabled, which is type ④, where data path falls back to the host.

The inter-node scenarios are more diverse since interaction with the NIC has to take place. The details of each method's implementations may involve complicated data paths and decision making. We distinguish four main components of inter-node communication for a simpler classification. Apart from the API and the data path (to the NIC) used in the intra-node scenario, there are two additional components involving interaction with the NIC:

- Register/construct messages. This step involves the construction of data packets and their registration on the NIC.
- Trigger communication. It defines who rings the doorbell on the NIC to issue data transfer.

We identify five main categories when classifying inter-node communication. From 1 to 5, as Table 2 and Figure 2 show, more components of communication calls were moved to the GPU side, while the data transfer path saw a reduction in the number of data copies required to reach the NIC. Over the years, the communication methods and the Manuscript submitted to ACM

corresponding technologies depict the optimizations of both data transfer and communication control, which will be discussed in Section 3. First, entirely CPU-side communication method in ① was available, which has been improved by removal of extra copy between CPU-GPU and CPU-NIC buffers in ②. After that, a GPU-side optimization facilitated direct access to the GPU memory from NIC ③ minimizing the data path between them. ④ represents GPU-triggered communication technologies such as GPUDirect Async and GPU-TN [53], where the GPU has become capable to initiate communications, given that CPU prepared the packets on the NIC in advance. ⑤ moves the packet preparation and interaction with the NIC entirely to the GPU as well, making device-native communication possible.

The types given in Table 2 and Figure 2 do not reflect all possible combinations, since some libraries, based on the configuration and the available hardware may lead to different combinations of data paths and control. For example, without the RDMA technology, even with GPU-side communication control, the data path will involve the host memory.

#### 3 VENDOR MECHANISMS

In this section we discuss the vendor-provided mechanisms for enabling communication and networking, managing memory across devices in a multi-GPU execution. These mechanisms are provided by GPU programming model runtimes or as part of the extended APIs.



Fig. 3. Timeline of NVIDIA technologies enabling GPU-centric communication and networking.

Figure 3 summarizes the technologies provided by NVIDIA, detailing their timeline and availability. The technologies are classified into four categories: memory managers, GPUDirect technologies, hardware, and libraries. Next, we introduce the memory management mechanisms and GPUDirect technologies, followed by the hardware support that served as precursors and ultimately made these communication methods viable. These technologies form the backbone of the higher-level GPU-centric libraries, which are discussed in Section 4.

#### 3.1 Memory Management Mechanisms

3.1.1 Page-Locked / Pinned Memory. By default, memory allocated on the host using cudaMalloc() is pageable and is not accessible to the GPU. When a transfer between pageable host memory and device memory is performed, the GPU runtime must first stage the host data through a temporary buffer in page-locked memory and then copy the data from page-locked memory to the GPU. To avoid the pageable → page-locked memory copy, cudaMallocHost() allows

Manuscript submitted to ACM

allocating page-locked memory directly, skipping the intermediate copy stage. Because of this, page-locked memory is also referred to as **zero-copy** or **pinned** memory [39, 72].

Pinned memory, known for its high bandwidth and low latencies in host-device transfers [93], efficiently coordinates CPU-GPU execution by enabling direct system-wide access. It is also utilized with GPUDirect RDMA for improved inter-node communication [55]. However, its physical memory locking can lead to significant memory consumption, potentially impacting system performance with excessive allocations [39].

- 3.1.2 Unified Virtual Addressing (UVA). UVA is a memory management technique introduced in CUDA 4.0 which allows all GPUs and CPUs within a node to share the same unified virtual address space [66, 72]. Prior to UVA, host  $\leftrightarrow$  device and device  $\leftrightarrow$  device copies had to explicitly specify the direction of transfer. With UVA, the physical memory location can be inferred from pointer values, thus, reducing the overhead of managing separate memory spaces and enabling libraries to simplify their interfaces [103].
- 3.1.3 **CUDA IPC.** In early CUDA versions, pointers could not be accessed across process boundaries, so memory copies between GPU buffers had to go through the host, creating a bottleneck. To overcome this limitation, CUDA 4.1 introduced CUDA Inter-Process Communication (IPC), which enables processes on the same node to access device buffers of other processes without additional copies [69]. With CUDA IPC, memory handles are created and passed between processes using standard IPC mechanisms, resulting in lower latencies than staging copies through the host. However, the overhead of creating memory handles can be significant and may offset the latency benefits [98].
- 3.1.4 Unified Virtual Memory (UVM). Introduced in CUDA 6.0, UVM allows for the allocation of managed memory through cudaMallocManaged() calls by creating a single address space accessible to all processors within a single node. UVM works by dividing the requested memory into pages that are resident on the CPU. The programmer can access memory on a device without explicit copies. If a memory access is part of a page that is not on the device, the UVM-driven triggers a page-fault that automatically migrates the page to the requesting device. The UVM driver can also evict pages from a given device back to host memory when the total paged memory size exceeds device memory [72].

UVM provides several benefits in regard to programmability. First, programmers are exposed to a single unified address space that they can access as if the whole allocated chunk of memory is resident on a single GPU. Any copies occurring around the system are implicit and hidden from the programmer's view. Additionally, UVM allows memory oversubscription whereby more memory can be allocated than all the multi-GPU device memory combined. This is possible since most of the memory can stay on the CPU and be paged in whenever a given device requests it [71, 109].

## 3.2 GPUDirect Technologies

3.2.1 **GPUDirect 1.0 (NIC)**. Introduced in CUDA 3.1, GPUDirect 1.0 allowed GPUs and NICs to share the same pinned memory region. Prior, the pinned memory regions in system memory for GPUs and the NIC were separate. By implication, to communicate GPU data across nodes, the GPU first copied the data to its pinned memory region, the CPU then copied it to the NIC's memory region, only then it can be accessed by the NIC, which sent it across the network, as shown in Figure 2 (1). The intermediate CPU-initiated copy from GPU  $\rightarrow$  NIC pinned memory regions adds CPU overhead and increases the latency for GPU communication. GPUDirect 1.0 introduced a shared memory GPU-NIC pinned memory region, thus, avoiding the intermediate CPU-initiated copy [102, 107].

3.2.2 **GPUDirect 2.0 (Peer-to-Peer).** Along with the introduction of UVA, the CUDA 4.0 release added support for direct peer-to-peer communication among GPUs in a single node that share the same PCIe root complex [66]. This functionality was encapsulated in a technology known as GPUDirect 2.0 or GPUDirect P2P. Instead of staging data through the host, GPUs could now directly access each other's memory over PCIe, establishing, for the first time, a direct GPU-to-GPU data path. These changes led to two new communication mechanisms: *P2P DMA Copies* whereby a *cudaMemcpy* call would trigger a DMA transfer directly between source and target GPU memories and *P2P Direct Load / Stores* using which the GPUs could directly access data by dereferencing pointers to the remote GPU buffers. GPUDirect P2P also added support for NVLink (Section 3.4) when the latter technology was introduced [67, 78, 102].

GPUDirect P2P provided two main benefits. It eliminated redundant GPU  $\leftrightarrow$  CPU copies and host buffers, which were required when the transfers were staged through the CPU. Also, by eliding the need to maintain communication buffers on the host and providing a new communication mechanism (P2P Direct Load / Stores), GPUDirect P2P increased the convenience of multi-GPU programming [67].

We note that P2P DMA Copies can also work without UVA support. If UVA is not enabled, P2P DMA Copies can be performed using the *cudaMemcpyPeer()* variants by explicitly specifying the target GPU. However, P2P Direct Load / Stores will not work without UVA as directly accessing a remote GPU's pointer presumes a unified address space [72].

3.2.3 **GPUDirect RDMA**. With the introduction of GPUDirect RDMA in CUDA 5.0, direct communication between NVIDIA GPUs across nodes became feasible. GPUDirect RDMA facilitates a direct communication channel between GPUs and third-party devices through standard PCIe features. The technology exposes segments of GPU memory on the PCIe memory resource, referred to as the Base Address Register (BAR) region. This enables NICs to directly read/write GPU memory without routing through the host [75]. Analogously, AMD offers ROCm RDMA (previously called ROCnRDMA) [7, 9].

GPUDirect RDMA provides several optimizations to the data path, namely by eliminating additional copies to host memory, reducing inherent latencies stemming from GPU-NIC interaction, increasing bandwidth and reducing CPU overhead. Support for GPUDirect RDMA has been integrated into several leading communication libraries including GPU-aware MPI implementations, NCCL and NVSHMEM.

3.2.4 **GPUDirect Async**. While previous GPUDirect technologies focused on improving the data path, GPUDirect Async optimizes the control path between the GPU and the NIC. Introduced in CUDA 8.0, it enables GPUs to initiate and synchronize network transfers, thereby reducing the CPU's involvement in the critical path. GPUDirect Async works by having the CPU pre-register messages, which the GPU kernel can then trigger by ringing a doorbell on the NIC. As a result, the GPU can continue executing while the communication is being triggered, rather than needing to stop for the CPU to initiate the communication, as was previously necessary [1, 2].

Although GPUDirect Async has led to improvements in efforts to move the control path away from the CPU, it still does not completely transfer the control path to the GPU since communication is limited to kernel launch boundaries. Essentially, the GPU can only initiate messages previously registered by the CPU. Further improvements to GPUDirect Async are implemented as part of the IBGDA transport in the NVSHMEM library (Section 4.3.1).

## 3.3 GPUNetIO

GPUNetIO [16] is a technology solution proposed by NVIDIA as part of DOCA (Datacenter-On-a-Chip Architecture) [27]. DOCA is a full-stack software framework designed to facilitate the development of applications for NVIDIA BlueField Data Processing Units (DPUs) [18]. On non-RDMA networks GPUNetIO allows the GPU to send, receive, and process Manuscript submitted to ACM

network packets [16, 17]. On RDMA networks (both RoCE and InfiniBand) [27], from DOCA v2.7, GPUNetIO allows the GPU to execute RDMA send and receive *not only on kernel boundaries*, but at any point during the kernel execution. In a nutshell, GPUNetIO allows the GPU to interact with the NIC without any CPU intervention on the critical path.

On RDMA networks, the GPU kernel can wait (in blocking or nonblocking mode) for the completion of RDMA receive operations. On non-RDMA networks, GPUNetIO provides semaphores that can be used explicitly within the kernel for synchronization with the NIC when sending and receiving packets. Semaphores can also be used to synchronize GPU kernels with the CPU (in case the packet processing is split between the CPU and the GPU) or with other CUDA kernels (if the packet processing is split across multiple kernels).

#### 3.4 Modern GPU-centric Interconnects

NVLink is a proprietary interconnect technology that facilitates high bandwidth and low latency direct access between NVIDIA GPUs. Its design addresses the bandwidth limitations of PCIe, which has been observed to be a transfer bottleneck in GPU-accelerated applications [55, 70]. Table 3 presents the specifications of each generation of NVLink [34, 55, 79].

| Generation | Number of<br>NVLink slot | Per direction                      | Total Aggregate | Supported<br>Architecture |
|------------|--------------------------|------------------------------------|-----------------|---------------------------|
|            |                          | Bandwidth per                      | Bi-Directional  |                           |
|            |                          | NVLink (GB/sec) Bandwidth (GB/sec) |                 | Architecture              |
| First      | 4                        | 20                                 | 160             | Pascal                    |
| Second     | 6                        | 25                                 | 300             | Volta                     |
| Third      | 12                       | 25                                 | 600             | Ampere                    |
| Fourth     | 18                       | 25                                 | 900             | Hopper                    |
| Fifth      | 18                       | 50                                 | 1800            | Blackwell                 |

Table 3. NVLink Generation Specifications

Additionally, NVLink was also used to connect GPUs with the CPU for IBM Power8 and Power9 CPUs but with the introduction to Grace Hopper Superchip, NVLink is used as a Chip-to-Chip (C2C) interconnect with 900 GB/sec Bi-directional bandwidth [55, 83]. Later with the introduction of Grace Blackwell Superchip, NVLink-C2C is used for connecting Grace CPU with 2 Blackwell GPUs with a total of 3.6 TB/sec bidirectional Bandwidth [82]. Fifth-generation NVLink on NVIDIA Blackwell delivers 1.8TB/s bidirectional throughput per GPU, providing high-speed communication among up to 576 GPUs.

While direct GPU-to-GPU P2P communication had already been established over PCIe with GPUDirect P2P, it was heavily bottlenecked by the low bandwidths of PCIe. The introduction of NVLink optimized the bandwidth between NVIDIA GPUs, turning P2P communication into a viable mechanism for intra-node communication and shifting the *data path* heavily in favor of GPUs. A disadvantage of NVLink is that it is not self-routed meaning that if any two given GPUs do not have a direct NVLink connection communication will have to be routed through an intermediate GPU [55]. This limitation is overcome by NVSwitch [79], a backboard technology that can implement all-to-all connections between all GPUs. As an example, a DGX-2 node consists of 16 V100 GPUs that are all-to-all connected through NVLink and NVSwitch [74]. Starting from the third generation, NVSwitch supports SHARP [84], which offloads allreduce operations to NVSwitch, allowing allreduce to operate at full line rate [41].

AMD also offers a proprietary interconnect technology called xGMI/Infinity Fabric. However, it currently lacks a technology comparable to NVSwitch. While up to 8 MI300X GPUs can be connected in a full all-to-all mesh, NVSwitch Manuscript submitted to ACM

can connect up to 64 GPUs. Recently, the Ultra Accelerator Link (UALink) consortium was established to develop a more open shared memory accelerator interconnect, compatible with multiple technologies and vendors [60].

### 3.5 Discussion on Vendor Mechanisms

3.5.1 Impact of GPUDirect P2P and Direct Load/Store on Programming. The introduction of GPUDirect P2P marked a significant shift in the paradigm of multi-GPU execution, enabling direct communication between GPUs using load and store operations from within the kernel. Direct Load/Store-based communication offers several benefits. First, it allows the programmer to inline communication with computation, potentially reducing code complexity and improving programmer productivity. The programmer no longer has to rely on separate models for communication and computation and can instead combine them within the GPU kernel [51, 95, 97]. Second, Direct Load/Stores utilize the high levels of parallelism offered by the GPU and can achieve higher levels of bandwidth and lower latencies compared to DMA copies [14, 92]. Third, Direct Load/Stores can implicitly overlap communication with computation through the GPU's inherent latency hiding capabilities. Given both the high levels of parallelism granted by the GPU and the increasing bandwidth numbers offered by modern interconnects, the GPU has the capability to hide latencies not only to local but remote memory as well [94, 95, 97]. This is another boon for the programmer as the method of achieving overlap is shifted from a manual software-based approach implemented by the programmer through streams and events to an automatic hardware-based overlap. Since the onus of communication/computation overlap is passed from the programmer to the hardware, another implication is that the overlap will improve as the hardware gets better at hiding memory latencies. Fourth, Direct Load / Stores expand the scope of applications that could be accelerated through multiple GPUs. Traditionally, applications with fine-grained communication patterns achieved poor scalability on multi-GPU systems as computation frequently had to be interrupted and synchronized in order for the CPU to initiate communication. With Direct Load / Stores from within the kernel, GPUs can adapt well to fine-grained communication patterns.

Despite the improvements conferred by Direct Load / Stores, there are several inherent challenges. First, a fundamental challenge is that communication and computation contend for the same limited resource as they now both require large volumes of GPU threads to make progress. This can be especially problematic when communication is implemented as a separate kernel. If the computation kernel is launched first, it can potentially monopolize all GPU resources preventing the communication kernel from being launched, effectively, eliminating any possibility of overlap. It is possible to alleviate this issue by launching the communication stream with a higher priority so that it is always scheduled first. We note that P2P DMA Copies do not have this issue as they use the GPU's DMA / Copy Engines - a physically separate resource - for communication [15]. Second, similar to single-GPU memory accesses, P2P Direct Load / Stores are highly sensitive to memory coalescing with random non-coalesced access performing far worse than coalesced accesses [14]. Such non-coalesced Direct Reads may expose remote memory latencies which are beyond the GPU scheduler's ability to hide, eventually, stalling execution. On a similar note, sporadic non-coalesced Direct Writes at sub-cacheline granularities may dramatically underutilize the interconnect [62].

3.5.2 **Limitation of GPUDirect RDMA**. A significant limitation of GPUDirect RDMA is that there are no guarantees of consistency between GPU and NIC memories *while a kernel is running*. Consistency is guaranteed only by returning control to the CPU by tearing down the kernel and launching a new kernel, thus, limiting communication to kernel boundaries. This also implies that combining persistent kernels with GPU-initiated inter-node communication will inevitably lead to data correctness issues [75]. Chu et al. get around this limitation by issuing a PCIe read from the NIC

to GPU memory which flushes the previous NIC writes to the GPU and guarantees memory ordering [25]. Since version 11.3, CUDA also offers the *cudaDeviceFlushGPUDirectRDMAWrites()* API which can be used to enforce consistency similarly [31, 73]. While useful, CUDA still relies on the CPU to enforce GPU-NIC consistency. AMD, on the other hand, has explicitly corrected the GPU-NIC consistency issues in the context of device-side communication from persistent kernels and integrated the proposed fixes into ROC\_SHMEM [38]. We further discuss this issue in the context of CPU-free networking in Section 5.3.

3.5.3 Enabling Triggering Capability in GPU-Centric Communication. In type (3) (GPUDirect/ROCn RDMA) introduced in Section 2, the CPU is still responsible for the initial configuration of the system, the data transfer preparation and initiating transfers. The first phase includes setting up network interfaces and loading GPU drivers. The CPU registers GPU memory with the RDMA-capable NIC. This registration by host allows the NIC to directly access GPU memory, bypassing the need for intermediary CPU steps during data transfer. In phase two, the CPU allocates GPU memory buffers and ensures they are aligned correctly. These buffers will be used for efficient data transfer to and from the GPU. Then, the CPU sets up GPU streams and events that manage and sequence the data transfers and, guarantees the compilation of the work. Streams are used to queue operations, ensuring they are executed in the correct order. However, for truly low-latency applications, this cost may still represent a bottleneck as the mechanism relies on several synchronization points through the streams [63, 64, 100].

GPU-trigger communication in type ④ and ⑤ defined in Section 2 facilitates the offloading of both computation and communication control paths to the GPU by removing the synchronization cost described above. Here, triggered operations play a crucial role, as they are special tasks scheduled to execute only when specific conditions are met. In the stream-triggered (ST) strategy, these operations manage data transfers and synchronization through the GPU control processor (GPU CP), thereby reducing CPU involvement. Deferred Execution is another essential aspect, where the CPU creates command descriptors with deferred execution semantics and appends them to the NIC command queue. These descriptors are executed when the conditions specified by the GPU control operations are fulfilled. For example, in the HPE Slingshot 11 NIC supports these deferred operations [63]¹, including sending and receiving communications that are triggered when a hardware counter reaches a given threshold. This is possible by enabling specific command queues (e.g., Libfabric Deferred Work Queues). The synchronization between the GPU control processor and the NIC ensures the successful completion of communication operations through a special mechanism (e.g., in NVIDIA DOCA this is implemented with GPUNetIO semaphores [16]).

## 4 GPU-CENTRIC COMMUNICATION LIBRARIES

We now discuss the main GPU-centric communication libraries namely, GPU-aware MPI, GPU-centric Collectives and GPU-centric OpenSHMEM that have sprouted in recent years to ease multi-GPU programming.

## 4.1 GPU-Aware MPI

Given that MPI is the de facto *lingua franca* of HPC, much effort has gone into making MPI communication interoperable with existing GPU programming models. This work has culminated in *GPU-Aware MPI*, typically defined as *MPI implementations that can differentiate between host and device buffers*. Prior to *GPU-Aware MPI*, all multi-GPU communication had to be staged through the host incurring a device  $\rightarrow$  host copy on the source GPU and a host  $\rightarrow$  device copy on the target GPU. Using a *GPU-Aware MPI* implementation, on the other hand, a programmer can

 $<sup>^1</sup>$ 2023. Libfabric Deferred Work Queue, https://ofiwg.github.io/libfabric/v1.9.1/man/fi\_trigger.3.html Manuscript submitted to ACM

supply device buffers as parameters to the MPI call allowing communication to use the direct GPU-to-GPU data path established by GPUDirect RDMA or ROCnRDMA. In the process, GPU-awareness eliminates redundant host  $\leftrightarrow$  device copies and simplifies the communication code by eliding the need for host buffers.

MVAPICH2 was the first MPI implementation to begin actively integrating GPU-awareness into its runtime. Early work done prior to the introduction of GPUDirect RDMA added basic GPU-awareness whereby users could initiate MPI calls with buffers residing on the GPU. The resulting communication still had to be staged through the host, but this was done transparently by the library and optimized using pipelining schemes for the host  $\leftrightarrow$  device and device  $\leftrightarrow$  device transfers. These pipelining schemes were made possible by UVA, which allowed the library to differentiate between host and device pointers without relying on user hints. The ensuing GPU-awareness led to performance improvements over the GPU-oblivious version [123-125]. A follow-up work used CUDA IPC to optimize intra-node transfers, which prior had to be staged through buffers in host memory [98]. Eventually, support was added for GPUDirect RDMA over the rendezvous protocol allowing transfers to bypass the host and eliminate redundant host  $\leftrightarrow$  device copies. This reduced latencies; however, bandwidth was limited due to existing architectural limitations [96]. A subsequent work added support for GPUDirect RDMA over the eager protocol rectifying the bandwidth limitation and further reducing latencies. Additionally, a new loopback mechanism and an early version of GDRCopy [76] were used to eliminate expensive host ↔ device cudaMemcpys [110]. GDRCopy allows GPU memory to be mapped to the user address space which is optimized for small message sizes with minimal overhead [31, 76, 104]. Another work extended point-to-point MPI calls to support GPUDirect Async allowing the GPU to progress the communication enqueued by the CPU, thus, optimizing the control path [119]. Other works have also increasingly focused on adding UVM-awareness to MVAPICH2-GDR [13, 37, 56].

While MVAPICH2 certainly paved the way for GPU-aware MPI, other leading MPI implementations have also integrated support for GPU-awareness. OpenMPI [89, 128], HPE Cray MPICH [40], MVAPICH2 and IBM Spectrum MPI [61] natively support CUDA-awareness. OpenMPI also supports CUDA through UCX. Given the wider deployment of NVIDIA GPUs, the literature on ROCm-aware MPI is sparse. Existing MPI implementations with the sole exception of MVAPICH2 rely on UCX for ROCm-awareness [6, 8, 90]. On the other hand, Khorassani et al. provide a native ROCm-aware runtime for MVAPICH2 which outperforms OpenMPI with UCX on a cluster of AMD GPUs [104].

4.1.1 Streaming Support in GPU-aware MPI. Despite the benefits of GPU-aware MPI, there are inherent challenges with MPI-GPU integration. Most importantly, there is a fundamental semantic mismatch between MPI and GPU programming models. GPUs operate on the concept of streams which are command queues that guarantee ordering among GPU operations. The GPU scheduler ensures that kernels and other operations launched on a stream execute in the order they were enqueued and do so with correct data dependencies. Since kernel launches are asynchronous and do not block the host, GPU runtimes can pipeline kernel launches and overlap the launch latencies behind kernel execution. The semantic mismatch between the MPI and GPU models is that MPI has no awareness of GPU streams. As a result, it is not possible to enqueue an MPI call on a given GPU stream or for a GPU stream to wait on the completion of a pending MPI routine. By implication, interlacing MPI calls with GPU kernels will require host-blocking synchronizations in order to maintain data correctness. For example, before initiating an MPI send, the programmer has to block the host to synchronize all streams which operate on the send buffer. Similarly, waiting on completion of pending MPI communication will also require host-blocking synchronization. In either case, these forced synchronizations impair kernel launch pipelining, prevent opportunities for overlap and force the programmer into alternating bulk phases of communication and computation [32, 130].

The semantic mismatch between MPI and the GPU ultimately stems from trying to combine a CPU-centric communication runtime that has no concept of streams with GPU-based computation, which, by design, relies on streams for concurrency, synchronization, and overlap. This fundamental issue is unlikely to be addressed by piecemeal solutions. We see two possible non-mutually exclusive paths for resolving the semantic mismatch. The first is making MPI runtimes stream-aware by adding an explicit stream parameter to MPI routines. This would solve the issue of impaired kernel launch pipelining and allow MPI calls to seamlessly integrate into GPU runtimes. The second is providing the option of device-initiated MPI calls. This would reduce the programmer burden of juggling two distinct programming models and, additionally, provide implicit communication-computation overlap. Both directions have been explored in the literature on a limited scale. The FLAT compiler automatically converts device-side MPI calls to their host-side equivalents [59]. dCUDA implements device-side operations with MPI semantics but uses CPU helper threads for the actual communication. They rely on the GPU's inherent memory latency hiding capabilities to implicitly overlap communication with computation, ultimately outperforming a GPU-aware MPI baseline [36]. Namashivayam et al. explore new communication schemes to introduce GPU stream-awareness in MPI. They use the triggered operations feature on HPE Slingshot 11 interconnect, allowing the CPU to enqueue communication and synchronization operations to the NIC, which the GPU can then trigger. This reduces CPU involvement in the critical path and eliminates expensive synchronizations. While inter-node experiments show some performance benefits, the proposed scheme falters in intra-node setups as progress threads need to be used to emulate deferred execution semantics [64]. A follow-up work eliminates progress threads for intra-node communication, opting to use P2P Direct Load Store-based GPU kernels and GPU IPC-based mechanisms instead. The evaluation shows performance improvements over stream-oblivious MPI baselines [63].

### 4.2 **GPU-Centric Collectives**

As deep learning models get ever larger, their compute requirements necessitate deploying training across multiple GPUs. In line with this, many works use GPU-centric mechanisms to optimize multi-GPU training. Given the prevalence of collective communication in deep learning training, both NVIDIA and AMD provide efficient collective communication libraries.

NCCL (NVIDIA Collective Communication Library) is a software library that offers topology-aware collective primitives for inter-GPU communication [77]. Additionally, since version 2.7, NCCL also provides point-to-point communication APIs. It has been integrated as a communication backend for several state-of-the-art deep learning frameworks, including Pytorch, Tensorflow, MXNet, Caffe, CNTK, and Horovod [127]. AMD offers an analogous library called RCCL.

The implementation of GPU-Aware collectives involving computation (like reduce/allreduce) in MPI were implemented using GPU kernels for local reductions and CPU-initiated copies among the GPUs to perform aggregation. This approach incurs several kernel launch and communication call latencies and, additionally, requires intermediate buffers on the host. NCCL takes a different approach by implementing the communication and computation for the collective together in a single kernel [68]. The first version of NCCL, NCCL 1.0, only supported intra-node communication, which was implemented using *P2P Direct Load / Stores*. NCCL 2.0 added support for inter-node communication, which relies on GPUDirect RDMA and proxy threads running on the CPU [45, 46].

NCCL offers several benefits. First, NCCL has been shown to achieve high levels of bandwidth and parallel efficiency [15, 50, 68, 127]. Second, NCCL is topology-aware, thus, relieving the programmer of optimizing for the topology. Optionally the programmer can force a specific path for communication. Third, NCCL is natively compatible with the Manuscript submitted to ACM

CUDA programming model and can seamlessly integrate with it. The NCCL API calls come with a stream parameter, allowing communication-computation overlap by enqueueing the collective communication on a separate stream.

Despite its benefits, NCCL does come with its own set of inherent challenges. First, a fundamental issue of using GPU threads for communication *and* computation is that the two routines now contend for the same limited resource. In this case, if computation is scheduled ahead of the NCCL collective, it can monopolize all GPU resources, effectively serializing the collective behind the computation. One workaround around this is to launch the NCCL collective on a higher-priority stream such that it is always scheduled first [15, 68].

4.2.1 Further Work on Collectives. Even though both P2P communication and collectives play an important role in HPC, the demand from deep learning has made optimizing collective communication workloads an important and attractive area of research. Since early versions of NCCL did not support inter-node communication, hierarchical collective mechanisms are proposed to support scaling out across nodes. Subsequent work challenges NCCL's hegemony of deep learning collective communication workloads by proposing new pipelined designs for the MPI\_Bcast collective. The proposed design outperforms NCCL2 broadcast for small message sizes and offers comparable performance for larger ones [12]. Blink is a collective communication library with the express goal of achieving optimal link utilization. To do this, Blink detects the underlying topology, models the topology as a graph, and then uses a technique known as packing spanning trees to dynamically generate communication primitives. As a result, Blink was shown to reduce model training time on an image classification task by 40% compared to NCCL2 [122]. Dryden et al. present Aluminum, a GPU-aware library for large-scale training of deep neural networks. Namely, Aluminum aims to rectify the inherent drawbacks of both NCCL and MPI. Aluminum extends NCCL with tree-based algorithms to avoid the latency bottlenecks of its default ring implementation. Additionally, it adds support for non-blocking NCCL allreduce operations. For MPI, Aluminum gets around forced synchronizations stemming from the MPI-GPU semantic mismatch by associating a single GPU stream with an MPI communicator and synchronizing with respect only to that stream. The resulting optimizations bring about speedups compared to GPU-aware MPI and NCCL-based implementations [32].

Awan et al. use the newly released NCCL to implement efficient intra-node *MPI\_Bcast* as part of MVAPICH2-GDR [11]. In a recent work, Chen et al. use CUDA IPC to optimize MPI intra-node all-to-all communication. They use CUDA IPC to pass device buffers across process boundaries and then communicate using *P2P Direct Load / Stores* for small messages and *P2P DMA Copies* for larger ones [20].

Like UCX, UCC is an open-source project providing an API and library implementation of high-performance and scalable collective operations, leveraging topology-aware algorithms and techniques such as in-network computing. It relies on UCX point-to-point communication, as well as on NCCL/RCCL, SHARP, and others.

### 4.3 GPU-centric OpenSHMEM

GPU-centric OpenSHMEM runtimes are NVIDIA's NVSHMEM and AMD's ROC\_SHMEM libraries. The two libraries are fundamentally similar but have some differences. Given its earlier inception, we first discuss NVSHMEM and, in doing so, introduce concepts fundamental to both libraries. In the ROC\_SHMEM section, we discuss its important differences from NVSHMEM.

4.3.1 **NVSHMEM**. NVSHMEM is NVIDIA's implementation of the OpenSHMEM specification for CUDA devices. NVSHMEM is a Partitioned Global Adress Space (PGAS) library that provides efficient one-sided *put / get* APIs for processes to access remote data objects. NVSHMEM supports point-to-point and collective communication between GPUs both within and across nodes [80].

NVSHMEM works on the concept of a *symmetric heap*. During NVSHMEM initialization, each process that is mapped to a GPU, referred to as a processing element (PE) reserves a block of GPU memory using *nvshmem\_malloc()*. In NVSHMEM, all memory allocations must be performed collectively, meaning that all symmetric memory regions within the heap must have identical sizes and must be allocated at the same time. To access remote memory on a different PE, a given PE requires the offset for the symmetric memory as well as the rank of the remote PE.

In addition, NVSHMEM provides APIs for synchronizing a group of PEs. These APIs comprise signal-wait mechanisms that can serve as a means for point-to-point synchronization and collective synchronization calls that can function as global barriers. This feature is particularly important since there is a general lack of kernel-side global barriers, with the CPU typically performing the role of global synchronizer for devices. The capacity for a device to synchronize efficiently across the device without terminating kernel execution is a crucial prerequisite for transferring the control plane to the GPU.

A notable attribute of NVSHMEM is that it offers both host-side and device-side APIs. The host-side APIs expose an optional stream argument that can be used to implement communication-computation overlap. For certain calls, the GPU-side variants provide the calls in three granularities: thread, thread block, and warp. The thread variant means that the call should be performed by a single device thread and will be executed by that thread. The thread block and warp variants use multiple threads to execute the communication call cooperatively. These variants should be called by all threads in the corresponding thread block or warp. A previous performance comparison between the host-side and device-side APIs found negligible differences in performance between the two, with host-side APIs slightly outperforming the device-side variants [35]. This study was conducted using an early version of NVSHMEM (0.3.0), which has since seen improvements in GPU-side API performance.

As of version 2.7.0, NVSHMEM introduced the Infiniband GPUDirect Async (IBGDA) transport built on top of GPUDirect Async [81]. The IBGDA transport allows GPUs to issue inter-node communication directly to the NIC, bypassing the CPU entirely. Without IBGDA, device-side inter-node communication calls are performed through a proxy thread on the CPU that triggers the corresponding NIC operations. This proxy thread consumes CPU resources and creates a bottleneck in achieving peak NIC throughput for fine-grained transfers [91]. NVSHMEM with IBGDA support, combined with persistent kernels, enables the complete transfer of both data and control paths to the GPU and marks a significant shift towards fully autonomous multi-GPU execution. However, as discussed in Section 3.2.3, GPUDirect RDMA only enforces GPU-NIC memory consistency across kernel boundaries. This inherent reliance on the CPU for memory consistency is a potential obstacle toward truly autonomous multi-GPU execution. One workaround is using a callback mechanism whereby the persistent kernel signals the CPU to perform a consistency-enforcing API call (i.e., cudaDeviceFlushGPUDirectRDMAWrites()). The efficacy of this solution is unclear and warrants further investigation. Enforcing GPU-NIC memory from inside the kernel is supported by ROC\_SHMEM, which we discuss next.

4.3.2 ROC\_SHMEM. ROC\_SHMEM is AMD's implementation of the OpenSHMEM specification for AMD GPUs. ROC\_SHMEM offers two communication backends. The first, known as GPU-IB, implements Infiniband on the GPU, similar to NVSHMEM's IBGDA transport. The second, called Reverse Offload (RO), uses host-side proxy threads and offloads communication to the CPU. GPU-IB is the default backend and offers the best performance [10]. ROC\_SHMEM works almost identically to NVSHMEM and offers analogous APIs. However, there are several significant differences. First, as mentioned in the previous section, NVSHMEM runs into GPU-NIC memory consistency problems when intra-node communication is issued from persistent kernels. ROC\_SHMEM, on the other hand, explicitly addresses Manuscript submitted to ACM

this issue and guarantees correctness when persistent kernels are being used. Hamidouche et al. analyze the GPU-NIC memory mismatch stemming from the GPU's relaxed memory model and propose changes integrated into ROC\_SHMEM [38]. This means that ROC\_SHMEM provides completely CPU-free communication mechanism that can move the entire flow of multi-GPU execution to the device.

Second, ROC\_SHMEM uses GPU shared memory (local data store (LDS) in AMD parlance) to store network state for faster access. To the best of our knowledge, this optimization is not implemented in NVSHMEM. While this is most likely beneficial for execution time, the increased shared memory message could limit occupancy and negatively impact performance [4, 99].

Third, prior versions of ROC\_SHMEM required allocating symmetric buffers as uncacheable in order to prevent stale data from being communicated. However, as AMD recently introduced intra-kernel cache flush instructions, the data can be flushed before initiating the network transaction, allowing the data to be cached. No such instructions are provided by NVIDIA, meaning that NVSHMEM buffers are likely allocated as uncacheable [99].

4.3.3 Applications. In recent years, NVSHMEM has been integrated as a communication backend into multiple runtimes. PETSc implemented PetscSF, a scalable communication layer based on NVSHMEM, to complement their MPI-based approach, which did not work well with CUDA stream semantics and prevented kernel launch pipelining [130]. Kokkos Remote Spaces, adding distributed memory support to the Kokkos programming model, uses NVSHMEM as a communication backend [26, 118]. An NVSHMEM implementation of the Kokkos Conjugate Gradient Solver outperforms the CUDA-aware MPI implementation while significantly reducing the code base size [57]. Choi et al. use persistent kernels and NVSHMEM to implement CharminG, a fully GPU-resident runtime system inspired by Charm++ [24]. The Livermore Big Artificial Neural Network (LBANN) implements a spatial-parallel convolution using NVSHMEM that outperforms MPI and Aluminium implementations [57]. QUDA, a library for lattice QCD computations, has used NVSHMEM and persistent kernels for improved strong scaling of Dirac operators [52, 121].

NVSHMEM has also been used outside of runtime-based approaches to achieve performance improvements. Chu et al. combine NVSHMEM with persistent kernels to implement a state-of-the-art GPU-based key-value store [25]. Xie et al. use NVSHMEM to implement a single-node multi-GPU sparse triangular solver (SpTRSV) that achieves good performance scalability compared to a UVM-based design [129]. Ding et al. combine persistent kernels with NVSHMEM for impressive performance in a sparse triangular solver (SpTRSV) on single- and multi-node setups. Atos implements both persistent and discrete kernels with NVSHMEM-based communication to achieve state-of-the-art performance on multi-GPU BFS within and across nodes [21]. Wang et al. propose MGG, a system design that accelerates Graph Neural Networks (GNNs) on multi-GPU systems using a GPU-centric software communication-computation pipeline with NVSHMEM for fine-grained communication [126]. Ismayilov et al. use persistent kernels and device-side NVSHMEM to implement fully GPU-side Jacobi 2D/3D and CG solvers that outperform CPU-controlled baselines. They reserve some thread blocks for communication while others handle computation, a technique they call *thread block specialization*, to achieve *explicit* device-side communication-computation overlap [42]. Punniyamurthy et al. use ROC\_SHMEM and persistent kernels to overlap embedding operations with collective communication in deep learning recommendation models [99].

## 5 DISCUSSION, CHALLENGES, AND OUTLOOK

As multi-GPU execution becomes a necessity for many real-world workloads, we dedicate this section to discussions and outlooks, presenting what we believe to be fertile ground for current and future research on GPU-centric communication.

#### 5.1 Moving Away from CPU

There are several reasons why we believe moving away from CPU-controlled execution shows promise. First, it addresses the issue of CPU-induced latency barriers that are caused by kernel launch and memory copy overheads. These barriers become more significant in strong scaling scenarios as the number of GPUs increases and computation per GPU decreases. In latency-bound settings, traditional CPU-controlled implementations are not able to overlap communication with computation as the latencies to initiate the operations take longer than the operations themselves effectively serializing communication and computation. On the other hand, CPU-free execution can still achieve adequate levels of overlap even when latencies dominate [42, 121]. Second, the parallelism offered by within-kernel communication is well-suited for persistent kernels to take advantage of. With the use of efficient communication and synchronization APIs, this execution model can achieve higher bandwidths and lower latencies than CPU-controlled implementations. Additionally, CPU-free execution, by virtue of inlining communication with computation, is well-suited for applications with fine-grained communication [21]. Third, ROC\_SHMEM allows for the first time to fully migrate application execution to the device with zero reliance on CPU helper threads. NVSHMEM with its IBGDA transport can also migrate a significant amount of execution to GPU but must still rely on the CPU for functional correctness.

However, there are several challenges facing this model of execution. One major challenge is that persistent kernels can result in reduced occupancy potentially bottlenecking computation. As of today, if global device or multi-GPU barriers are required, persistent kernels must be launched in a *cooperative* manner. This means that only as many threads can be launched as can run concurrently at the same time making hardware oversubscription impossible. As a result, workload decomposition and scheduling, which were previously handled by the hardware scheduler, now need to be manually done by the programmer. This manual approach is unlikely to be as efficient as hardware-based scheduling, and compute-intensive applications are likely to suffer. Furthermore, long-running persistent kernels will consume more registers and may use shared memory as well limiting occupancy even further. Nevertheless, we see a two-fold solution to this problem. First, there is a large amount of high bandwidth shared memory available across application execution, which can potentially nullify the performance hit caused by reduced occupancy. Second, we predict that as GPU vendors strive more and more for greater GPU autonomy, they will introduce APIs that allow for combining persistent kernels with hardware oversubscription. The manual decomposition can also be handled by an optimized compiler / runtime system.

Another potential problem with NVSHMEM and ROC\_SHMEM is their ease of integration into existing runtimes. Both libraries center around a *symmetric heap* and all communication buffers must be allocated collectively on the same symmetric heap by all GPUs. This symmetric allocation requires library-specific allocators. Existing runtimes may find it hard to add support for NVSHMEM and ROC\_SHMEM because of the symmetric memory allocation requirement.

#### 5.2 UCX as a Potential Pathway for GPU-Awareness

Unified Communication X (UCX) is an open-source communication framework that abstracts over several network APIs, programming models, protocols, and implementations. The idea is to provide a set of high-level primitives while hiding the low-level implementation details behind the UCX runtime. The relevance of UCX for GPU-centric communication is that its tagged and stream APIs can be used to implement a *GPU-centric* communication layer for both ROCm and CUDA. The actual GPU-centric communication then uses the corresponding native libraries [108, 116].

Using UCX for realizing *GPU-centric* communication is a recent direction in the literature that has started to take hold. Perhaps the most relevant example is that of *ROCm-awareness* for MPI implementations. Much early work on *GPU-aware* MPI was done for NVIDIA GPUs using native CUDA libraries and then integrated directly into MPI runtimes. Perhaps reticent to replicate the same work to make their runtimes ROCm-aware, most MPI implementations provide ROCm-awareness only through UCX. OpenMPI additionally provides CUDA support through UCX, besides its native integration. In non-MPI work, Choi et al. extend the UCX layer in Charm++ to provide GPU-aware communication for several programming models in the Charm++ ecosystem [22, 23].

We predict that more runtimes will gravitate toward UCX to add support for GPU-centric communication. Using UCX APIs frees the programmers from relying on native vendor-specific APIs and allows adding GPU-aware communication for both ROCm and CUDA. However, there are some caveats. One is that it needs to be clarified if GPU-aware UCX can overlap communication with computation. The second caveat is that the increased generality and convenience may trade off performance. A performance comparison between GPU-awareness through native APIs and that provided through UCX would be helpful. In one work in this direction, Khorassani et al. provide a native ROCm-aware runtime for MVAPICH2, which outperforms OpenMPI with UCX on a cluster of AMD GPUs [104]. However, the performance difference may be due to differences in the MPI implementations, not UCX. Another current limitation of UCX is that it is always on the host, and it does not use any communication kernel, but relies on the traditional *cudamemcpy* family of functions in addition to zero copy RDMA.

### 5.3 CPU-Free Networking

As the trend toward GPU-centric communication and greater GPU autonomy continues to accelerate, several works have suggested migrating most or all of the networking stack to the kernel. This is typically done by launching a single long-running persistent kernel and moving both the data and control paths to the GPU.

In the earliest work, GGAS [85] proposes changes to network devices to implement a unified global address space that allows moving the control path entirely to the GPU. This is accomplished by using a persistent kernel that contains the computation, communication, and synchronization all on the device-side. While the work was the first of its kind and showed performance improvements compared to a CUDA-Aware MPI baseline, the experiments were conducted on two GPU nodes with one GPU each, while the proposed hardware changes were emulated on an FPGA. Follow-up work showed that GGAS, by virtue of eliminating CPU involvement in the control path, can achieve further performance improvements and reduce energy usage compared to CPU-controlled baselines [48, 87].

Several more works on CPU-free networking followed. Oden et al. [86] use GPUDirect RDMA to allow GPUs to directly interface with Infiniband network devices without the involvement of the host CPU. They do this by mapping the entire Infiniband context to the device-side and using the GPU to generate and send work requests to the HCA. However, because of slow single-thread work request generation performance on the GPU, the proposed changes deteriorated performance compared to CPU-controlled baselines. Follow-up work ameliorated these performance limitations and showed much more promising results [47, 49]. Another work combines the proposed GPU-side Infiniband Verbs with CUDA Dynamic Parallelism to optimize the bottleneck of intra-kernel synchronization [88]. GPUrdma also implements Infiniband on the GPU and proposes a GPU-side library for direct communication from within persistent GPU kernels with zero CPU involvement. The proposed design outperforms a CPU-controlled baseline on a series of microbenchmarks but runs into correctness issues resulting from the use of persistent kernels and GPU-NIC interaction [30].

Silberstein et al. implement GPUNet, which provides GPU-side socket abstractions and networking primitives [113]. GPUNet allows invoking the communication on the GPU but does not fully migrate the control path to the device;

Manuscript submitted to ACM

instead, it relies on CPU helper threads to perform the actual communication. A similar approach is adopted by dCUDA, which provides device-side APIs with MPI semantics but translates them to standard MPI calls performed by CPU helper threads [36]. LeBeane et al. categorized GPU networking methods discussing at length their deficiencies. In response, they propose GPU-TN, a NIC hardware mechanism that allows the CPU to create and register messages with the NIC and the GPU to trigger them from a running persistent kernel [53]. Another work, ComP-Net, uses embedded GPU microprocessors to offload helper threads from the CPU to the GPU [54]. While both GPU-TN and ComP-Net show promising performance, they require hardware changes to the NIC and the GPU and, thus, rely on simulation to obtain results.

## 5.4 Broader GPU Autonomy

The recent proliferation of *GPU-centric* communication represents the general trend toward broader GPU autonomy. Several works, early and recent, have tried to hand the GPU the reins of domains that have traditionally been the purview of the CPU. In an early work, Stuart et al. propose methods that allow the GPU to issue callbacks to the CPU [114]. Silberstein et al. implement GPUfs, which allows the GPU to request files on the host CPU directly from inside a GPU kernel [112]. Veselý et al. implement support for invoking POSIX system calls from inside GPU kernels through changes to the Linux kernel [120]. NVIDIA's GPUDirect Storage provides a direct data path between GPUs and storage but still relies on the CPU to orchestrate execution [117]. In a recent work by NVIDIA, Qureshi et al. present BaM, the first approach that allows GPUs to directly access storage without any CPU involvement. Experimental results show that BaM outperforms GPUDirect Storage on several workloads [101].

These and other works show a clear trend toward general GPU autonomy. In line with this, we expect further optimizations to GPU-centric communication. Several recent mechanisms are promising, as pointed out by Punniyamurthy et al. [99]. First, the recent thread block(TB) cluster abstraction could benefit device-side communication-computation overlap and inter-TB synchronization. Second, AMD's recent cache flush instructions allow flushing the cache before initiating network communication, meaning communication no longer needs to be allocated as uncacheable. Third, recent hardware trends like fatter GPU nodes and tight GPU-NIC integration are also promising [5, 33]. For example, the most recent iteration of NVSwitch directly connects 256 Grace Hopper Superchips enabling direct P2P all-to-all communication at an unprecedented scale.

### 5.5 Collective Algorithms Design

Multi-GPU systems present unprecedented challenges in the design of collective operations algorithms. First, GPUs within the same node are often arranged in complex and non-traditional topologies, making existing collective algorithms less effective. Second, there is significant heterogeneity in the bandwidth between GPU pairs, both within a node and between nodes. For instance, there can be up to a 4x difference in bandwidth among different GPU pairs on the same node [29], and up to a 10x gap between intra-node and inter-node bandwidth. Third, these topological and connectivity characteristics frequently change between GPU generations. These factors complicate the design of collective algorithms, potentially leading to underutilization of available bandwidth and poor performance of collective operations.

Some approaches use linear programming formulations to find the optimal collective algorithm based on a network specification and the size of the collective [19, 106, 122]. However, this involves solving an NP-hard problem that scales exponentially. For example, finding a solution for 128 nodes can take up to 11 hours [106], and a new solution might be needed if the number of nodes or the size of the collective changes. This complexity makes generating collective algorithms for large systems challenging, if not impossible. To simplify the implementation of multi-GPU collective Manuscript submitted to ACM

operations, MSCCL [28] provides a high-level language for expressing collective operations, which is then compiled to NCCL code.

## 5.6 Debugging and Profiling Support

Efficient programming tools are essential for productive multi-GPU programming. However, the available tools are severely lacking when it comes to communication native to the GPU. While NVIDIA's flagship system-level profiling tool, NSight Systems, provides a detailed view of host-controlled communication, it falls short in providing information on device-native communication, including Direct Load/Store P2P communication and communication induced by libraries such as NCCL and NVSHMEM.

Palwisha et al. present ComScribe [3], a tool that allows monitoring NCCL collective and P2P communication. However, it is limited to a single node and relies on the deprecated *nvprof* tool. New efforts in this area have been made by Snoopie [43], which focuses on profiling and visualizing GPU-centric communication. Snoopie is capable of attributing communication to source code lines and objects involved in communication, offering different levels of granularity. This allows for both a coarse-grained overview of the system and a detailed view of a specific object or device in terms of data movement.

Another class of tools that can help with debugging GPU communication are race detectors. Given that many of the aforementioned communication libraries use a partitioned global address space model compared with message passing, there is a high likelihood of introducing race hazards. Race detector tools are vital to navigate the complexities of shared data in multi-GPU programming. Despite their importance, none of the existing tools are capable of detecting race hazards in multi-GPU programming. Compute Sanitizer's Racecheck tool [65] is limited to on-chip shared memory, supporting the detection of race hazards only within a single GPU context. HiRace [44] improves on that by supporting global memory and can detect many types of race hazards that Compute Sanitizer's Racecheck cannot, but it is also limited to a single GPU context.

We believe that introducing debugging and profiling tools capable of detecting fine-grained device-native transfers both within and across nodes, in addition to race detectors capable of capturing race hazards in a multi-GPU context, is crucial for further advancements in GPU-centric communication.

## 6 CONCLUSION

Traditionally, multi-GPU communication was managed by the CPU, but recent advancements in GPU-centric communication have begun shifting this responsibility, allowing GPUs to take more control over the communication tasks. This paper provides an in-depth exploration of GPU-centric communication, emphasizing vendor-supplied mechanisms and user-level library support. It seeks to demystify the complexities and variety of options in this area, define key terms, and categorize the various approaches used within individual nodes and across multiple nodes.

The discussion includes an analysis of vendor-provided communication and memory management mechanisms for multi-GPU execution, as well as a review of major communication libraries such as CUDA-aware MPI, NCCL/RCCL, NVSHMEM, and ROC\_SHMEM, highlighting their advantages, challenges, and performance considerations. Furthermore, the paper presents important research paradigms such as CPU-free networking, debugging tools for communication, and discusses future directions, and unresolved questions. By thoroughly examining GPU-centric communication techniques across both software and hardware layers, this paper aims to equip researchers, developers, engineers, and library designers with the knowledge needed to fully leverage multi-GPU systems.

#### **ACKNOWLEDGEMENT**

Authors at Koç University were supported from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 949587).

#### **REFERENCES**

- Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading Communication Control Logic in GPU Accelerated Applications. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Madrid, Spain) (CCGrid '17). Institute for Electrical and Electronics Engineers, New York, NY, USA, 248–257. https://doi.org/10.1109/CCGRID.2017.29
- [2] E. Agostini, D. Rossetti, and S. Potluri. 2018. GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters. J. Parallel and Distrib. Comput. 114 (2018), 28–45. https://doi.org/10.1016/j.jpdc.2017.12.007
- [3] Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah, and Didem Unat. 2021. ComScribe: Identifying Intra-node GPU Communication. In Benchmarking, Measuring, and Optimizing, Felix Wolf and Wanling Gao (Eds.). Springer International Publishing, Cham, 157–174.
- [4] AMD. [n.d.]. AMD Instinct MI200 Instruction Set Architecture. https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf.
- [5] AMD. 2021. AMD CDNA<sup>TM</sup> 2 ARCHITECTURE. https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf.
- [6] AMD. 2023. GPU-aware MPI with ROCm. https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-gpu-aware-mpi-readme/#.
- [7] AMD. 2023. ROCK-Kernel-Driver. https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver.
- [8] AMD. 2023. ROCm Documentation: GPU-Enabled MPI. https://rocm.docs.amd.com/en/latest/how\_to/gpu\_aware\_mpi.html.
- [9] AMD. 2023. ROCnRDMA. https://github.com/rocmarchive/ROCnRDMA.
- [10] AMD. 2023. ROC\_SHMEM. https://github.com/ROCm-Developer-Tools/ROC\_SHMEM.
- [11] A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda. 2016. Efficient Large Message Broadcast Using NCCL and CUDA-Aware MPI for Deep Learning. In Proceedings of the 23rd European MPI Users' Group Meeting (Edinburgh, United Kingdom) (EuroMPI 2016). Association for Computing Machinery, New York, NY, USA, 15–22. https://doi.org/10.1145/2966884.2966912
- [12] Ammar Ahmad Awan, Karthik Vadambacheri Manian, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K. Panda. 2019. Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2? Parallel Comput. 85 (2019), 141–152. https://doi.org/10.1016/j.parco. 2019.03.005
- [13] Dip Sankar Banerjee, Khaled Hamidouche, and Dhabaleswar K. Panda. 2016. Designing High Performance Communication Runtime for GPU Managed Memory: Early Experiences (GPGPU '16). Association for Computing Machinery, New York, NY, USA, 82–91. https://doi.org/10.1145/ 2884045 2884050
- [14] Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-Scale Graph Processing. ACM Trans. Parallel Comput. 7, 3, Article 18 (jun 2020), 27 pages. https://doi.org/10.1145/3399730
- [15] Massimo Bernaschi, Elena Agostini, and Davide Rossetti. 2021. Benchmarking multi-GPU applications on modern multi-GPU integrated systems. Concurrency and Computation: Practice and Experience 33, 14 (2021), e5470. https://doi.org/10.1002/cpe.5470 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.5470
- [16] Stephen Brosky. 2023. Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO. https://developer.nvidia.com/blog/inline-gpu-packet-processing-with-nvidia-doca-gpunetio/.
- [17] Stephen Brosky. 2023. Optimizing Inline Packet Processing Using DPDK and GPUDirect with GPUs. https://developer.nvidia.com/blog/optimizing-inline-packet-processing-using-dpdk-and-gpudev-with-gpus/.
- [18] Idan Burstein. 2021. Nvidia Data Center Processing Unit (DPU) Architecture. In 2021 IEEE Hot Chips 33 Symposium (HCS). 1–20. https://doi.org/10.1109/HCS52781.2021.9567066
- [19] Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing Optimal Collective Algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). Association for Computing Machinery, New York, NY, USA, 62–75. https://doi.org/10.1145/3437801.3441620
- [20] Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. Panda. 2022. Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 24–33. https://doi.org/10.1109/IPDPSW55747.2022.00014
- [21] Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable Irregular Parallelism with GPUs: Getting CPUs out of the Way. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC '22). Institute for Electrical and Electronics Engineers, New York, NY, USA, Article 50, 16 pages.
- [22] Jaemin Choi, Zane Fink, Sam White, Nitin Bhat, David F. Richards, and Laxmikant V. Kale. 2021. GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 479–488. https://doi.org/10.1109/IPDPSW52791.2021.00079
- [23] Jaemin Choi, Zane Fink, Sam White, Nitin Bhat, David F. Richards, and Laxmikant V. Kale. 2022. Accelerating communication for parallel programming models on GPU systems. Parallel Comput. 113 (2022), 102969. https://doi.org/10.1016/j.parco.2022.102969

- [24] Jaemin Choi, David F. Richards, and Laxmikant V. Kale. 2021. CharminG: A Scalable GPU-Resident Runtime System. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing (Virtual Event, Sweden) (HPDC '21). Association for Computing Machinery, New York, NY, USA, 261–262. https://doi.org/10.1145/3431379.3464454
- [25] Ching-Hsiang Chu, Sreeram Potluri, Anshuman Goswami, Manjunath Gorentla Venkata, Neena Imam, and Chris J. Newburn. 2019. Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM. In OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity, Swaroop Pophale, Neena Imam, Ferrol Aderholdt, and Manjunath Gorentla Venkata (Eds.). Springer International Publishing, Cham, 148–164.
- [26] Jan Ciesko. 2023. Kokkos Remote Spaces Repository. https://github.com/kokkos/kokkos-remote-spaces.
- [27] NVIDIA Corporation. 2023. NVIDIA DOCA SDK Documentation. https://docs.nvidia.com/doca/sdk/index.html.
- [28] Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2023. MSCCLang: Microsoft Collective Communication Language. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 502-514. https://doi.org/10.1145/ 3575693.3575724
- [29] CSC. 2024. LUMI-G Supercomputer. https://docs.lumi-supercomputer.eu/hardware/lumig/.
- [30] Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels (ROSS '16). Association for Computing Machinery, New York, NY, USA, Article 6, 8 pages. https://doi.org/10.1145/2931088.2931091
- [31] Seth Howell Davide Rossetti, Pak Markthub. 2021. The Latest in GPUDirect. https://www.nvidia.com/en-us/on-demand/session/gtcspring21s32039/.
- [32] Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Andy Yoo, Marc Snir, and Brian Van Essen. 2018. Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems. In 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC). 1–13. https://doi.org/10.1109/MLHPC.2018.8638639
- [33] Jonathon Evans, Michael Andersch, Vikram Sethi, Gonzalo Brito, and Vishal Mehta. 2022. NVIDIA Grace Hopper Superchip Architecture In-Depth. https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/.
- [34] Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37, 2 (2017), 7–17. https://doi.org/10. 1109/MM.2017.37
- [35] Taylor Groves, Ben Brock, Yuxin Chen, Khaled Z. Ibrahim, Lenny Oliker, Nicholas J. Wright, Samuel Williams, and Katherine Yelick. 2020.
  Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 126–137. https://doi.org/10.1109/PMBS51919.2020.00016
- [36] Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 609–620. https://doi.org/10.1109/SC. 2016.51
- [37] Khaled Hamidouche, Ammar Ahmad Awan, Akshay Venkatesh, and Dhabaleswar K. Panda. 2016. CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). 52–61. https://doi.org/10.1109/HiPC.2016.016
- [38] Khaled Hamidouche and Michael LeBeane. 2020. GPU INitiated OPenSHMEM: Correct and Efficient Intra-Kernel Networking for DGPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California) (PPoPP '20). Association for Computing Machinery, New York, NY, USA, 336–347. https://doi.org/10.1145/3332466.3374544
- [39] Mark Harris. 2012. How to Optimize Data Transfers in CUDA C/C++. https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/.
- [40] HPE. 2021. Cray MPICH Documentation. https://cpe.ext.hpe.com/docs/mpt/mpich/intro\_mpi.html.
- [41] Alexaner Ishii and Ryan Wells. 2023. The NVLink-Network Switch: NVIDIA's Switch Chip for High Communication-Bandwidth Superpods. https://hc34.hotchips.org/assets/program/conference/day2/Network%20and%20Switches/NVSwitch%20HotChips%202022%20r5.pdf.
- [42] Ismayil Ismayilov, Javid Baydamirli, Doğan Sağbili, Mohamed Wahib, and Didem Unat. 2023. Multi-GPU Communication Schemes for Iterative Solvers: When CPUs Are Not in Charge. In Proceedings of the 37th International Conference on Supercomputing (Orlando, FL, USA) (ICS '23). Association for Computing Machinery, New York, NY, USA, 192–202. https://doi.org/10.1145/3577193.3593713
- [43] Mohammad Kefah Taha Issa, Muhammad Aditya Sasongko, Ilyas Turimbetov, Javid Baydamirli, Doğan Sağbili, and Didem Unat. 2024. Snoopie: A Multi-GPU Communication Profiler and Visualizer. In Proceedings of the 38th ACM International Conference on Supercomputing (Kyoto, Japan) (ICS '24). Association for Computing Machinery, New York, NY, USA, 525–536. https://doi.org/10.1145/3650200.3656597
- [44] John Jacobson, Martin Burtscher, and Ganesh Gopalakrishnan. 2024. HiRace: Accurate and Fast Source-Level Race Checking of GPU Programs. arXiv:2401.04701 [cs.DC]
- [46] Sylvain Jeaugey. 2019. Distributed Neural Network Training: NCCL On Summit. https://www.olcf.ornl.gov/wp-content/uploads/2019/12/Summit-NCCL.pdf.
- [47] Benjamin Klenk, Lena Oden, and Holger Froening. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In 2014 43rd International Conference on Parallel Processing Workshops. 411–418. https://doi.org/10.1109/ICPPW.2014.61
- [48] Benjamin Klenk, Lena Oden, and Holger Fröning. 2014. GPU-centric communication for improved efficiency. In International Workshop on Green Programming, Computing and Data Processing (GPCDP) in conjunction with International Green Computing Conference (IGCC), Dallas, TX, USA.

[49] Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing communication models for distributed thread-collaborative processors in terms of energy and time. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 318–327. https://doi.org/10.1109/ISPASS.2015.7095817

- [50] Jiri Kraus. 2021. Multi-GPU Programming Models. https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31050/.
- [51] Akhil Langer and Jim Dinan. 2021. NVSHMEM: GPU-Integrated Communication for NVIDIA GPU Clusters. https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32515/.
- [52] lattice. 2023. QUDA Repository. https://github.com/lattice/quda.
- [53] Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2017. GPU Triggered Networking for Intra-Kernel Communications (SC '17). Association for Computing Machinery, New York, NY, USA, Article 22, 12 pages. https://doi.org/10. 1145/3126908.3126950
- [54] Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2018. ComP-Net: Command Processor Networking for Efficient Intra-Kernel Communications on GPUs. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (Limassol, Cyprus) (PACT '18). Association for Computing Machinery, New York, NY, USA, Article 29, 13 pages. https://doi.org/10.1145/3243176.3243179
- [55] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31, 1 (jan 2020), 94–110. https://doi.org/10.1109/TPDS.2019.2928289
- [56] K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni, and D. K. Panda. 2019. Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (Providence, RI, USA) (GPGPU '19). Association for Computing Machinery, New York, NY, USA, 43–52. https://doi.org/10.1145/3300053.3319419
- [57] Naoya Maruyama, Brian Van Essen, Jan Ciesko, Jeremiah Wilke, Christian Trott, Chung-Hsing Hsu, Neena Imam, Jim Dinan, Akhil Langer, CJ Newburn, and Sreeram Potluri. 2020. Scaling Scientific Computing with NVSHMEM. https://developer.nvidia.com/blog/scaling-scientific-computing-with-nvshmem/.
- [58] Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. 2023. Top 500. https://www.top500.org/. Accessed: 2023-07-29.
- [59] Takefumi Miyoshi, Hidetsugu Irie, Keigo Shima, Hiroki Honda, Masaaki Kondo, and Tsutomu Yoshinaga. 2012. FLAT: A GPU Programming Framework to Provide Embedded MPI. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (London, United Kingdom) (GPGPU-5). Association for Computing Machinery, New York, NY, USA, 20–29. https://doi.org/10.1145/2159430.2159433
- [60] Timothy Prickett Morgan. 2024. Key Hyperscalers And Chip Makers Gang Up On Nvidia's NVSwitch Interconnect. https://www.hpcwire.com/ 2024/05/30/everyone-except-nvidia-forms-ultra-accelerator-link-ualink-consortium/.
- [61] IBM Spectrum MPI. 2021. IBM Spectrum MPI Version 10.2 Release Notes. https://www.ibm.com/docs/en/smpi/10.2?topic=release-notes.
- [62] Harini Muthukrishnan, David Nellans, Daniel Lustig, Jeffrey A. Fessler, and Thomas F. Wenisch. 2021. Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 139–152. https://doi.org/10.1109/ISCA52012.2021.00020
- [63] Naveen Namashivayam, Krishna Kandalla, James B White III au2, Larry Kaplan, and Mark Pagel. 2023. Exploring Fully Offloaded GPU Stream-Aware Message Passing. arXiv:2306.15773 [cs.DC]
- [64] Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Kaplan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. arXiv:2208.04817 [cs.DC]
- [65] NVIDIA. [n. d.]. https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html#racecheck-tool
- [66] NVIDIA. 2011. CUDA 4.0 Release Notes. https://developer.nvidia.com/cuda-toolkit-40.
- [67] NVIDIA. 2012. NVIDIA GPUDirect™ Technology. https://developer.download.nvidia.com/devzone/devcenter/cuda/docs/GPUDirect\_Technology\_ Overview.pdf.
- [68] NVIDIA. 2016. Fast Multi-GPU collectives with NCCL. https://developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl/.
- [69] NVIDIA. 2017. CUDA 4.1 Release Notes. https://developer.nvidia.com/cuda-toolkit-41-archive.
- [70] NVIDIA. 2017. NVIDIA DGX-1 With Tesla V100 System Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
- [71] NVIDIA. 2021. Improving GPU Memory Oversubscription Performance. https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/.
- [72] NVIDIA. 2023. CUDA Programming Guide Release 12.2. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
- [73] NVIDIA. 2023. CUDA Runtime Device Management. https://docs.nvidia.com/cuda/cuda-runtime-api/group\_CUDART\_DEVICE.html#group\_CUDART\_DEVICE.
- [74] NVIDIA. 2023. DGX-2. https://www.nvidia.com/en-gb/data-center/dgx-2/.
- [75] NVIDIA. 2023. GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/.
- [76] NVIDIA. 2023. Magnum IO GDRCopy. https://developer.nvidia.com/gdrcopy.
- [77] NVIDIA. 2023. NCCL. https://developer.nvidia.com/nccl.
- [78] NVIDIA. 2023. NVIDIA GPUDirect Family. https://developer.nvidia.com/gpudirect.
- [79] NVIDIA. 2023. NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/.
- [80] NVIDIA. 2023. NVSHMEM. https://developer.nvidia.com/nvshmem.

- [81] NVIDIA. 2023. NVSHMEM 2.7.0 Release Notes. https://docs.nvidia.com/nvshmem/release-notes/release-270.html#release-270.
- [82] NVIDIA. 2024. NVIDIA GB200. https://www.nvidia.com/en-us/data-center/gb200-nvl72/.
- [83] NVIDIA. 2024. NVIDIA Grace Hopper Superchip. https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper.
- [84] NVIDIA. 2024. NVIDIA NVLink SGXLS10 Switch Systems User Manual. https://docs.nvidia.com/networking/display/sgxh100/introduction.
- [85] Lena Oden and Holger Fröning. 2013. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters. 2013 IEEE International Conference on Cluster Computing (CLUSTER) (2013), 1–8.
- [86] Lena Oden, Holger Fröning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. 976–983. https://doi.org/10.1109/IPDPSW.2014.111
- [87] Lena Oden, Benjamin Klenk, and Holger Fröning. 2014. Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 483–492. https://doi.org/10.1109/CCGrid.2014.21
- [88] Lena Oden, Benjamin Klenk, and Holger Fröning. 2014. Energy-Efficient Stencil Computations on Distributed GPUs Using Dynamic Parallelism and GPU-Controlled Communication. In 2014 Energy Efficient Supercomputing Workshop. 31–40. https://doi.org/10.1109/E2SC.2014.14
- [89] OpenMPI. 2023. Open MPI v5.0.x Documentation: CUDA. https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html.
- [90] OpenMPI. 2023. Open MPI v5.0.x Documentation: ROCm. https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/rocm.html.
- [91] Sreeram Potluri Pak Markthub, Jim Dinan and Seth Howell. 2022. Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/.
- [92] Carl Pearson. 2023. Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric. arXiv:2302.14827 [cs.DC]
- [93] Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, Jinjun Xiong, and Wen-Mei Hwu. 2019. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (Mumbai, India) (ICPE '19). Association for Computing Machinery, New York, NY, USA, 209–218. https://doi.org/10.1145/3297663. 3310299
- [94] Sreeram Potluri, Anshuman Goswami, Davide Rossetti, C.J. Newburn, Manjunath Gorentla Venkata, and Neena Imam. 2017. GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 253–262. https://doi.org/10.1109/HiPC.2017.00037
- [95] Sreeram Potluri, Anshuman Goswami, Manjunath Gorentla Venkata, and Neena Imam. 2018. Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM. In OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, Manjunath Gorentla Venkata, Neena Imam, and Swaroop Pophale (Eds.). Springer International Publishing, Cham, 82–96.
- [96] Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In 2013 42nd International Conference on Parallel Processing. 80–89. https://doi.org/10.1109/ICPP.2013.17
- [97] Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Manjunath Gorentla Venkata, Oscar Hernandez, Pavel Shamis, M. Graham Lopez, Mathew Baker, and Wendy Poole. 2015. Exploring OpenSHMEM Model to Program GPU-Based Extreme-Scale Systems. In Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies Volume 9397 (Annapolis, MD, USA) (OpenSHMEM 2015). Springer-Verlag, Berlin, Heidelberg, 18–35. https://doi.org/10.1007/978-3-319-26428-8\_2
- [98] S. Potluri, H. Wang, D. Bureddy, A.K. Singh, C. Rosales, and Dhabaleswar K. Panda. 2012. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 1848–1857. https://doi.org/10.1109/IPDPSW.2012.228
- [99] Kishore Punniyamurthy, Bradford M. Beckmann, and Khaled Hamidouche. 2023. GPU-initiated Fine-grained Overlap of Collective Communication with Computation. arXiv:2305.06942 [cs.DC]
- [100] Kishore Punniyamurthy, Khaled Hamidouche, and Bradford M Beckmann. 2023. Optimizing Distributed ML Communication with Fused Computation-Collective Operations. arXiv:2305.06942
- [101] Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min, Amna Masood, Jeongmin Park, Jinjun Xiong, C. J. Newburn, Dmitri Vainbrand, I-Hsin Chung, Michael Garland, William Dally, and Wen-mei Hwu. 2023. GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 325–339. https://doi.org/10.1145/3575693.3575748
- [102] Davide Rossetti, Sreeram Potluri, and David Fontaine. 2016. State of GPUDirect Technologies. https://on-demand.gputechconf.com/gtc/2016/presentation/s6264-davide-rossetti-GPUDirect.pdf.
- [103] Tim C. Schroeder. 2011. Peer-to-Peer & Unified Virtual Addressing. https://developer.download.nvidia.com/CUDA/training/cuda\_webinars\_ GPUDirect\_uva.pdf.
- [104] Kawthar Shafie Khorassani, Jahanzeb Hashmi, Ching-Hsiang Chu, Chen-Chun Chen, Hari Subramoni, and Dhabaleswar K. Panda. 2021. Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences. In High Performance Computing, Bradford L. Chamberlain, Ana-Lucia Varbanescu, Hatem Ltaief, and Piotr Luszczek (Eds.). Springer International Publishing, Cham, 118–136.
- [105] Agam Shah. 2024. Intel's Next-gen Falcon Shores Coming Out in Late 2025. https://www.hpcwire.com/2024/04/30/intels-next-gen-falcon-shores-coming-out-in-late-2025/.

[106] Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2022. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches. arXiv:2111.04867 [cs.DC]

- [107] Gilad Shainer, Ali Ayoub, Pak Lui, Tong Liu, Michael Kagan, Christian R. Trott, Greg Scantlen, and Paul S. Crozier. 2011. The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications. Computer Science - Research and Development 26, 3 (01 Jun 2011), 267–273. https://doi.org/10.1007/s00450-011-0157-1
- [108] Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez, Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L Graham, Liran Liss, et al. 2015. UCX: an open source framework for HPC network APIs and beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 40–43.
- [109] Chuanming Shao, Jinyang Guo, Pengyu Wang, Jing Wang, Chao Li, and Minyi Guo. 2022. Oversubscribing GPU Unified Virtual Memory: Implications and Suggestions. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering (Beijing, China) (ICPE '22). Association for Computing Machinery, New York, NY, USA, 67–75. https://doi.org/10.1145/3489525.3511691
- [110] Rong Shi, Sreeram Potluri, Khaled Hamidouche, Jonathan Perkins, Mingzhe Li, Davide Rossetti, and Dhabaleswar K. D K Panda. 2014. Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1–10. https://doi.org/10.1109/HiPC.2014.7116873
- [111] Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Toshio Endo, Akinori Yamanaka, Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka.
  2011. Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, Washington) (SC '11). Association for Computing Machinery, New York, NY, USA, Article 3, 11 pages. https://doi.org/10.1145/2063384.2063388
- [112] Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014. GPUfs: Integrating a File System with GPUs. ACM Trans. Comput. Syst. 32, 1, Article 1 (feb 2014), 31 pages. https://doi.org/10.1145/2553081
- [113] Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2016. GPUnet: Networking Abstractions for GPU Programs. 34, 3, Article 9 (sep 2016), 31 pages. https://doi.org/10.1145/2963098
- [114] Jeff A. Stuart, Michael Cox, and John D. Owens. 2010. GPU-to-CPU Callbacks. In Proceedings of the 2010 Conference on Parallel Processing (Ischia, Italy) (Euro-Par 2010). Springer-Verlag, Berlin, Heidelberg, 365–372.
- [115] Serve the Home. 2024. Intel Ponte Vecchio Spaceship GPU No Longer Hunting New Clusters. https://www.servethehome.com/intel-ponte-vecchio-spaceship-gpu-no-longer-hunting-new-clusters/.
- [116] The Unified Communication X Library [n. d.]. The Unified Communication X Library. http://www.openucx.org.
- [117] Adam Thompson and Newburn C.J. 2012. GPUDirect Storage: A Direct Path Between Storage and GPU Memory. https://developer.nvidia.com/blog/gpudirect-storage/.
- [118] Christian Trott. 2018. Early Experience with NVSHMEM: Extending the Kokkos Programming Model with PGAS Semantics. https://www.osti.gov/servlets/purl/1806950.
- [119] Akshay Venkatesh, Khaled Hamidouche, Sreeram Potluri, Davide Rosetti, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2017. MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling. In 2017 46th International Conference on Parallel Processing (ICPP). 151–160. https://doi.org/10.1109/ICPP.2017.24
- [120] Ján Veselý, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H. Loh, Mark Oskin, and Steven K. Reinhardt. 2018. Generic System Calls for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 843–856. https://doi.org/10.1109/ISCA.2018.00075
- [121] Mathias Wagner. 2020. GTC 20: Overcoming Latency Barriers: Strong Scaling HPC Applications with NVSHMEM. https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21673/.
- [122] Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2019. Blink: Fast and Generic Collectives for Distributed ML. arXiv:1910.04940 [cs.DC]
- [123] Hao Wang, Sreeram Potluri, Devendar Bureddy, Carlos Rosales, and Dhabaleswar K. Panda. 2014. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. IEEE Transactions on Parallel and Distributed Systems 25, 10 (2014), 2595–2605. https://doi.org/10.1109/TPDS.2013.222
- [124] Hao Wang, Sreeram Potluri, Miao Luo, Ashish Singh, Sayantan Sur, and D.K. Panda. 2011. MVAPICH2GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science - Research and Development 26 (06 2011), 257–266. https://doi.org/10.1007/s00450-011-0171-3
- [125] Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Xiangyong Ouyang, Sayantan Sur, and Dhabaleswar K. Panda. 2011. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2. In 2011 IEEE International Conference on Cluster Computing. 308–316. https://doi.org/10.1109/CLUSTER.2011.42
- [126] Yuke Wang, Boyuan Feng, Zheng Wang, Tong Geng, Ang Li, Kevin Barker, and Yufei Ding. 2023. MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms. In USENIX Symposium on Operating Systems Design and Implementation (OSDI'21).
- [127] Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. 2023. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. Journal of Computer Science and Technology 38, 1 (01 Feb 2023), 166–195. https://doi.org/10.1007/s11390-023-2894-6
- [128] Wei Wu, George Bosilca, Rolf vandeVaart, Sylvain Jeaugey, and Jack Dongarra. 2016. GPU-Aware Non-Contiguous Data Movement In Open MPI. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (Kyoto, Japan) (HPDC '16). Association for Computing Machinery, New York, NY, USA, 231–242. https://doi.org/10.1145/2907294.2907317

- [129] CHENHAO XIE, Jieyang Chen, Jesun Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, and Ang Li. 2021. Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP '21). Association for Computing Machinery, New York, NY, USA, Article 53, 11 pages. https://doi.org/10.1145/3472456.3472478
- [130] Junchao Zhang, Jed Brown, Satish Balay, Jacob Faibussowitsch, Matthew Knepley, Oana Marin, Richard Tran Mills, Todd Munson, Barry F. Smith, and Stefano Zampini. 2021. The PetscSF Scalable Communication Layer. arXiv:2102.13018 [cs.DC]